Reorder tests in maybe_downcast_numeric #55825

MichaelTiemannOSC · 2023-11-04T02:40:35Z

The comment # if we have any nulls, then we are done is not consistent with the test if isna(arr).any() because arr is constructed only from the first element (r[0]) not the full ravel'd list of values. Moreover, calling np.array() on some random type can have surprising consequences.

So instead, do the early-out test as intended, just using r[0] without going through np.array(). Then test other things about r[0]. Only then should we test all the values (and if we have any nulls, then we are done).

See #55824

[55824] closes #xxxx (Replace xxxx with the GitHub issue number)
Tests added and passed if fixing a bug or adding a new feature
All code checks passed.
Added type annotations to new arguments/methods/functions.
Added an entry in the latest doc/source/whatsnew/vX.X.X.rst file if fixing a bug or adding a new feature.

The comment `# if we have any nulls, then we are done` is not consistent with the test `if isna(arr).any()` because `arr` is constructed only from the first element (`r[0]`) not the full ravel'd list of values. Moreover, calling `np.array()` on some random type can have surprising consequences. So instead, do the early-out test as intended, just using `r[0]` without going through `np.array()`. Then test other things about `r[0]`. Only then should we test all the values (and if we have any nulls, then we are done). See pandas-dev#55824 Signed-off-by: Michael Tiemann <72577720+MichaelTiemannOSC@users.noreply.github.com>

MichaelTiemannOSC · 2023-11-06T09:34:34Z

While there is a conversation going on in #55824, these changes introduce no new functionality nor user-visible changes (other than tolerating Pint's behavior). All these changes do is remove a dependency on the behavior of np.array() that can be accomplished as well by other means, and to make the implementation match the comments regarding the results of encountering NaN/NA values in the values to be possibly downcasted.

WillAyd · 2023-11-06T14:34:04Z

Cool thanks for the PR. Can you add a test and a whatsnew note?

MichaelTiemannOSC · 2023-11-06T14:58:42Z

Would be happy to, but...the test case would require loading Pint. Last time I tried to do some test cases with alien packages I got back a big "No thanks!" from the pandas maintainers. Also, another PR I submitted (enabling many complex test cases) rejected the WhatsNew changes I wrote because there were no user-visible changes. Would you like to guide me further?

rhshadrach · 2023-11-07T03:09:38Z

pandas/core/dtypes/cast.py

+        if isna(r[0]):
+            # do a test on the first element, if it fails then we are done


This test on the first element goes back to dc73315#diff-fb8a9c322624b0777f3ff7e3ef8320d746b15a2a0b80b7cab3dfbe2e12e06daa in core.common. It appears to me it is no longer necessary.

So I constructed a simple test case and I think I understand what the earlier code was trying to do, which was to ravel only the first element of result. Consider:

import numpy as np import pandas as pd aa = np.array([1.0, np.nan, 2.0]) bb = np.array([aa, aa]) print(aa, pd.isna(aa)) # [ 1. nan 2.] [False True False] print(bb, pd.isna(bb)) # [[ 1. nan 2.] [ 1. nan 2.]] [[False True False] [False True False]] print(pd.isna(bb).any()) # True

The fact that pd.isna(bb).any() sees the nulls inside the sub-arrays means we don't need to ravel the whole thing to look for nulls. The reason to ravel, instead, is to look at the type of the elements, and we only need to ravel the first element to look into that. (We have already tested that there are elements to ravel.)
But...decimal.Decimal doesn't have a ravel, which explains perhaps why the original code tried to wrap just the first element back into an array to ravel that.

pandas/core/dtypes/cast.py

If the first element of `result` is an array, ravel that to get element we will test. Otherwise use it as is. We only need to check whether `result` is all non-null once. Signed-off-by: Michael Tiemann <72577720+MichaelTiemannOSC@users.noreply.github.com>

Don't use deprecated array indexing on ExtensionArrays. We need to now us `iloc`. Signed-off-by: Michael Tiemann <72577720+MichaelTiemannOSC@users.noreply.github.com>

jbrockmendel · 2023-11-08T23:10:34Z

pandas/core/dtypes/cast.py

-
-        elif not isinstance(r[0], (np.integer, np.floating, int, float, bool)):
+        if isinstance(result, np.ndarray):
+            element = result[0]


would result.item(0) work here? might avoid potential copies from ravel

My suggestion here turns out to cause a different problem bc .item(0) is wrong for dt64 ndarrays

result = np.arange(5).view("M8[ns]") >>> result.item(0) 0

When processing a multidimensional `ndarray`, we can get the first element by calling `result.item(0)` and completely avoid the copying needed by `ravel` to get the first element that way. We can also eliminates an additional conditional check. Signed-off-by: Michael Tiemann <72577720+MichaelTiemannOSC@users.noreply.github.com>

MichaelTiemannOSC · 2023-11-09T16:34:00Z

So to wrap this up...I don't think it's easy to add a test case because that would involve installing Pint, which might not be very CI/CD-friendly. If there's a good way to do that, please point me to a pattern I can follow. It's easy for me to describe the change in CHANGES, but its not really user-visible. But because there is an issue (though no test case), I could write it up anyway. Thoughts?

pandas/core/dtypes/cast.py

rhshadrach · 2023-11-10T22:51:17Z

@WillAyd

Can you add a test and a whatsnew note?

With the current implementation in this PR, I wouldn't expect any behavior changes. You good here?

It does change for Pint because there np.array([e]) will raise for certain Pint objects e, but I don't think we should be testing for this.

rhshadrach

lgtm

mroeschke · 2023-11-22T18:21:39Z

Thanks @MichaelTiemannOSC

MichaelTiemannOSC marked this pull request as ready for review November 6, 2023 09:31

rhshadrach reviewed Nov 7, 2023

View reviewed changes

MichaelTiemannOSC added 4 commits November 7, 2023 04:56

Merge branch 'main' into Issue_55824

51ee73c

Merge branch 'main' into Issue_55824

dc9851d

Update cast.py

8c6bfe1

Don't use deprecated array indexing on ExtensionArrays. We need to now us `iloc`. Signed-off-by: Michael Tiemann <72577720+MichaelTiemannOSC@users.noreply.github.com>

jbrockmendel reviewed Nov 8, 2023

View reviewed changes

rhshadrach reviewed Nov 9, 2023

View reviewed changes

pandas/core/dtypes/cast.py Show resolved Hide resolved

MichaelTiemannOSC mentioned this pull request Nov 12, 2023

Add support for UFloat in PintArray (#139) hgrecco/pint-pandas#140

Open

5 tasks

Merge branch 'main' into Issue_55824

2815572

rhshadrach approved these changes Nov 22, 2023

View reviewed changes

mroeschke added Dtype Conversions Unexpected or buggy dtype conversions ExtensionArray Extending pandas with custom dtypes or arrays. labels Nov 22, 2023

mroeschke added this to the 2.2 milestone Nov 22, 2023

mroeschke approved these changes Nov 22, 2023

View reviewed changes

mroeschke merged commit 4ee86a7 into pandas-dev:main Nov 22, 2023
44 of 46 checks passed

MichaelTiemannOSC mentioned this pull request Feb 27, 2024

BUG: maybe_downcast_numeric interferes with PintPandas #55824

Closed

3 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Reorder tests in maybe_downcast_numeric #55825

Reorder tests in maybe_downcast_numeric #55825

MichaelTiemannOSC commented Nov 4, 2023

MichaelTiemannOSC commented Nov 6, 2023

WillAyd commented Nov 6, 2023

MichaelTiemannOSC commented Nov 6, 2023

rhshadrach Nov 7, 2023

MichaelTiemannOSC Nov 7, 2023 •

edited

Loading

jbrockmendel Nov 8, 2023

jbrockmendel Dec 20, 2023

MichaelTiemannOSC commented Nov 9, 2023

rhshadrach commented Nov 10, 2023 •

edited

Loading

rhshadrach left a comment

mroeschke commented Nov 22, 2023

		if isna(r[0]):
		# do a test on the first element, if it fails then we are done

Reorder tests in maybe_downcast_numeric #55825

Reorder tests in maybe_downcast_numeric #55825

Conversation

MichaelTiemannOSC commented Nov 4, 2023

MichaelTiemannOSC commented Nov 6, 2023

WillAyd commented Nov 6, 2023

MichaelTiemannOSC commented Nov 6, 2023

rhshadrach Nov 7, 2023

Choose a reason for hiding this comment

MichaelTiemannOSC Nov 7, 2023 • edited Loading

Choose a reason for hiding this comment

jbrockmendel Nov 8, 2023

Choose a reason for hiding this comment

jbrockmendel Dec 20, 2023

Choose a reason for hiding this comment

MichaelTiemannOSC commented Nov 9, 2023

rhshadrach commented Nov 10, 2023 • edited Loading

rhshadrach left a comment

Choose a reason for hiding this comment

mroeschke commented Nov 22, 2023

MichaelTiemannOSC Nov 7, 2023 •

edited

Loading

rhshadrach commented Nov 10, 2023 •

edited

Loading